Debiasing SHAP scores in random forests

نویسندگان

چکیده

Abstract Black box machine learning models are currently being used for high-stakes decision making in various parts of society such as healthcare and criminal justice. While tree-based ensemble methods random forests typically outperform deep on tabular data sets, their built-in variable importance algorithms known to be strongly biased toward high-entropy features. It was recently shown that the increasingly popular SHAP (SHapley Additive exPlanations) values suffer from a similar bias. We propose debiased or "shrunk" scores based sample splitting which additionally enable detection overfitting issues at feature level.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Forests in Random Graphs

Let G be a graph and let I(G) be defined by I(G) = max{|F | : F is an induced forest in G}. Let d = (d1, d2, . . . , dn) be a graphic degree sequence such that d1 ≥ d2 ≥ · · · ≥ dn ≥ 1. By using the probabilistic method, we prove that if G is a graph with degree sequence d, then I(G) ≥ 2 n ∑

متن کامل

1 Random Forests - - Random Features

Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The error of a forest of tree classifiers depends on the strength of the individual tre...

متن کامل

Mondrian Forests: Efficient Online Random Forests

Ensembles of randomized decision trees, usually referred to as random forests, are widely used for classification and regression tasks in machine learning and statistics. Random forests achieve competitive predictive performance and are computationally efficient to train and test, making them excellent candidates for real-world prediction tasks. The most popular random forest variants (such as ...

متن کامل

Random Forests In Language Modeling

In this paper, we explore the use of Random Forests (RFs) (Amit and Geman, 1997; Breiman, 2001) in language modeling, the problem of predicting the next word based on words already seen before. The goal in this work is to develop a new language modeling approach based on randomly grown Decision Trees (DTs) and apply it to automatic speech recognition. We study our RF approach in the context of ...

متن کامل

Random Composite Forests

We introduce a broad family of decision trees, Composite Trees, whose leaf classifiers are selected out of a hypothesis set composed of p subfamilies with different complexities. We prove new data-dependent learning guarantees for this family in the multi-class setting. These learning bounds provide a quantitative guidance for the choice of the hypotheses at each leaf. Remarkably, they depend o...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: AStA Advances in Statistical Analysis

سال: 2023

ISSN: ['1863-8171', '1863-818X']

DOI: https://doi.org/10.1007/s10182-023-00479-7